Trees and Random Forests

Author: Anthony Strittmatter

We analyse the browsing and online purchasing behaviour of households using Comscore’s web browser data. The data file browser 2006.csv contains 1,500 households that spent at least 1 US-dollar online in 2006. The variable spend is the online spending (in US-dollars) of a household. Furthermore, the data contains the browser history of households for the 1,000 most heavily trafficked websites (see the list of websites in browser-sites.txt). In particular, the data contains the percentage of time spent on specific websites from the total time spent online. Additionally, we have access to the file browser new.csv, which contains the browser history of 500 new households, but not the online spending.

Data Preparation

Load the packages rpart, rpart.plot, grf, and DiagrammeR. Read the data sets browser_2006.csv and browser_new.csv. Generate matrices for the outcome, control, as well as id variables for both data sets.

Exercise 1: Data Description

a) How much is the average online spending in 2006?

b) Generate a variable for log online spendings. Plot the cumulative distribution of online spendings and log online spendings.

c) Randomly partition the 2006 data into a training and estimation sample of equal size. For this purpose, generate a variable that indicates the rows that are included in the training sample (using the sample command).

Exercise 2: Trees

a) Build in the training sample a shallow tree (terminal leaves should contain at least 150 observations) with the outcome log online spendings. Plot the structure of the shallow tree.

b) Build in the training sample a deep tree (terminal leaves should contain at least 10 observations) with the outcome log online spendings. Plot the cross-validated MSE.

c) Determine the optimal number of terminal leaves.

d) Prune the deep tree and plot the structure of the pruned tree.

e) Calculate the $\mathbf{R^2}$ in the test sample.

Exercise 3: Random Forests

a) Build in the training sample a random forest to predict log online spending. The forest should contain 1000 trees. Each tree should use a 50% subsample of the training data, 1/3 of the covariates, and restrict the min.node.size to 100.

b) Plot a tree of the forest.

c) Plot the variable importance. Why do we have to be cautious when interpreting the variable importance?

d) Use the forest to predict the online spendings in the test sample. Evaluate the performance of the random forest using the $\mathbf{R^2}$.

e) Draw an area under the curve (AUC) graph with regard to the number of trees in the forest.

f) Build a forest with smaller min.node.size (= 5) and test if this improves the $\mathbf{R^2}$ in the test sample.

g) Use the data browser new.csv, which contains the browsing behaviour of new potential customers. Predict the online spending in the new data using the prediction model that performs best in the test sample. Download the id's and the predicted spendings of the new customers in a csv-file. These predictions might help you to target marketing campaigns at the new potential customers with the highest (or lowest) expected online spending.

Useful links: